Environment Setup Guide (2)

By Hongyu Xiao

Environment Setup Guide for Research Computing

This guide provides instructions for setting up your research computing environment. Whether you're a new researcher, student, or staff member, proper environment configuration is crucial for efficient computational work.

Purpose and Objectives

The main objectives of this environment setup are:

Configure a consistent and reproducible computing environment

Enable access to necessary research software and tools

Establish proper paths and dependencies for computational tasks

Ensure security and best practices in research computing

By following this guide, you'll have a fully functional research computing environment that meets both your immediate needs and supports future scalability.

Environment Setup Guide

Configure your working environment:

Environment Setup Guide
- Shell configuration (.bashrc, .bash_profile)
```
# Example .bashrc configuration
# Custom aliases
alias ll='ls -la'
alias py='python3'
alias jn='jupyter notebook'

# Environment variables
export PATH=$HOME/bin:$PATH
export PYTHONPATH=$HOME/research/lib:$PYTHONPATH
```
Above is a basic example of a .bashrc configuration file that sets up common research computing environment elements including aliases, environment variables, and module loading.
```
# Example of my research computing aliases
# Jupyter notebook on specific port 
alias jupyter_8989="jupyter notebook --no-browser --port=8989"

# Navigate to my research data directory
alias ogs_ourdisk="cd /ourdisk/hpc/ogs/hongyux/dont_archive/"
```
Note: The /ourdisk/hpc/ogs/ directory is our OGS server storage location. Each user has an automatically created folder under this path following the structure: /ourdisk/hpc/ogs/yourusername/dont_archive/. This is where you should store your research data and files.
```
# Example of accessing your personal storage directory
# Replace 'yourusername' with your actual username
cd /ourdisk/hpc/ogs/yourusername/dont_archive/

# Creating a new directory for a specific project
mkdir /ourdisk/hpc/ogs/yourusername/dont_archive/project_name
```
- Module system overview
Our HPC systems use the module system to manage software environments. To see available software modules, use:
```
module avail           # List all available modules
module list            # Show currently loaded modules
module load python     # Load Python module
module unload python   # Unload Python module
```
Common software modules include:
- Python (various versions)
- Compilers (gcc, intel)
- MPI libraries
- Machine learning frameworks (TensorFlow, PyTorch)
Note: Some modules may conflict with each other. Use 'module unload' before loading potentially conflicting modules.

Using SLURM for Efficient Computing

While Jupyter notebook access through tunneling is available, using SLURM for job management often provides better efficiency and resource utilization. Here's my template for a basic SLURM script:
Here's an example of a GPU-enabled SLURM script for deep learning tasks:
```
#!/bin/bash
#SBATCH --partition=disc_dual_a100    # GPU partition
#SBATCH --gres=gpu:1                  # Request 1 GPU
#SBATCH --output=job_%J_.txt          # Output file
#SBATCH --error=job_%J_.txt           # Error file
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --mem=100G                     # Memory request
#SBATCH --time=24:00:00              # Time limit

# Run your deep learning script
python your_training_script.py
```
When using GPU resources, make sure to specify the appropriate partition (disc_dual_a100) and request GPU resources using the --gres flag. This ensures your job gets scheduled on nodes with available GPUs.

To submit your SLURM job, use:
```
sbatch your_script.sbatch
```
Common SLURM commands for job management:
- squeue -u $USER # Check your job queue
- scancel job_id # Cancel a specific job
- sinfo # Check partition information
This approach allows for better resource management and more efficient execution of computational tasks compared to interactive notebook sessions.

Here are examples of using squeue and grep to monitor jobs:

# View all jobs in the queue
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
123456 disc_dual python_tr hongyux  R    2:30:15      1 node001
123457 disc_dual tensor_jo  user2   R   12:45:22      1 node002
123458 disc_dual pytorch_t  user3  PD    0:00:00      1 (Resources)

# Filter jobs on disc partitions
$ squeue | grep disc
123456 disc_dual python_tr hongyux  R    2:30:15      1 node001
123457 disc_dual tensor_jo  user2   R   12:45:22      1 node002
123458 disc_dual pytorch_t  user3  PD    0:00:00      1 (Resources)
123459 disc_a100 train_ml  user4   R    5:12:33      1 node003

The output shows job ID, partition name, job name, user, status (R=running, PD=pending), runtime, number of nodes, and node assignment or reason for pending.

Advanced SLURM Usage Tips

Here are some additional SLURM commands and features that can help you manage your computational jobs more effectively:

1. Job Dependencies

You can make jobs wait for other jobs to complete:

# Wait for job 123456 to complete before starting
sbatch --dependency=afterok:123456 next_job.sh

# Wait for job 123456 to fail before starting
sbatch --dependency=afternotok:123456 cleanup_job.sh

2. Resource Monitoring

Monitor your job's resource usage:

sstat - View resource usage of running jobs

sacct - View completed job information

# View detailed job information
sacct -j JobID --format=JobID,JobName,MaxRSS,Elapsed

# Monitor memory usage of running job
sstat --format=AveCPU,AveRSS,AveVMSize --jobs JobID

Conda Setup

OSCER could load miniconda to create separate research environment for your need, for machine learning practice, different python package might be in need. Here is an example of setting up an environment.

# Step 1: Download and Install Miniconda
module load Miniconda3

# Step 2: Verify Miniconda Installation
conda --version

# Step 3: Create a New Environment with a Specific Python Version
# Replace X.Y with the desired Python version (e.g., 3.9, 3.10, 3.11)
conda create --name myproject python=X.Y

# Step 4: Activate the New Environment
conda activate myproject

# Step 5: Verify Python Version
python --version

# Optional: Install Specific Packages
conda install numpy pandas matplotlib

# To deactivate the environment when done
conda deactivate

# Useful Additional Commands:
# List all environments
conda env list

# Remove an environment
conda env remove --name myproject

You could also install conda in your desired location,

Visit https://docs.conda.io/en/latest/miniconda.html to download the appropriate installer

CUDA/Seisbench Setup

For CUDA, you could use module load to load any module you wanted including CUDA

By default, module load CUDA will load the most recent version of CUDA, for example

[hongyux@schooner3 ~]$ module spider cuda


--------------------------------------------------------------------------------------------
  CUDA:
--------------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform
      and programming model created by NVIDIA and implemented by the graphics processing
      units (GPUs) that they produce. CUDA gives developers access to the virtual
      instruction set and memory of the parallel computational elements in CUDA GPUs.

     Versions:
        CUDA/5.5.22-GCC-4.8.2
        CUDA/7.5.18-GCC-4.9.3-2.25
        CUDA/7.5.18
        CUDA/8.0.44-GCC-4.9.3-2.25
        CUDA/8.0.44-intel-2016a
        CUDA/8.0.61_375.26-GCC-5.4.0-2.26
        CUDA/9.1.85-GCC-6.4.0-2.28
        CUDA/9.2.88
        CUDA/10.1.105-GCC-8.2.0-2.31.1
        CUDA/10.1.243-GCC-8.3.0
        CUDA/11.0.2-GCC-9.3.0
        CUDA/11.1.1-GCC-10.2.0
        CUDA/11.3.1
        CUDA/11.5.0
        CUDA/11.7.0
        CUDA/11.8.0
        CUDA/12.0.0
        CUDA/12.1.1
        CUDA/12.2.0
        CUDA/12.3.0

In this scenario , if you type module load CUDA , you will be getting the following:

[hongyux@schooner3 ~]$ module load CUDA
[hongyux@schooner3 ~]$ module list

Currently Loaded Modules:
  1) binutils/2.38   2) M4/1.4.18   3) flex/2.6.4   4) CUDA/12.3.0

💡

Seisbench is compatible with CUDA but please be very careful about the CUDA version and seisbench version.

Here is an example of installing seisbench

# SeismoBench Installation Methods

# 1. Using pip (recommended for most users)
# Install a specific version
pip install seisbench==0.1.0

# Install the latest version
pip install seisbench

# 2. Conda Environment Installation
conda create -n seisbench python=3.9
conda activate seisbench
pip install seisbench

# 3. Additional Dependencies for Full Functionality
pip install torch torchvision torchaudio
pip install numpy pandas matplotlib

Here is an example of show seisbench version

(TL) [hongyux@schooner3 ~]$ pip show seisbench
Name: seisbench
Version: 0.7.0
Summary: The seismological machine learning benchmark collection
Home-page: 
Author: 
Author-email: Jack Woolam <jack.woollam@kit.edu>, Jannes Münchmeyer <munchmej@gfz-potsdam.de>
License: GPLv3
Location: /home/hongyux/.conda/envs/TL/lib/python3.12/site-packages
Requires: bottleneck, h5py, nest-asyncio, numpy, obspy, pandas, scipy, torch, tqdm

And seisbench has a good release page on Github. Please do take advantage of it https://github.com/seisbench/seisbench/releases

Check the seisbench version and CUDA version compatibility first if your code is not running.